18 research outputs found

    New scalable machine learning methods: beyond classification and regression

    Get PDF
    Programa Oficial de Doutoramento en Computación . 5009V01[Abstract] The recent surge in data available has spawned a new and promising age of machine learning. Success cases of machine learning are arriving at an increasing rate as some algorithms are able to leverage immense amounts of data to produce great complicated predictions. Still, many algorithms in the toolbox of the machine learning practitioner have been render useless in this new scenario due to the complications associated with large-scale learning. Handling large datasets entails logistical problems, limits the computational and spatial complexity of the used algorithms, favours methods with few or no hyperparameters to be con gured and exhibits speci c characteristics that complicate learning. This thesis is centered on the scalability of machine learning algorithms, that is, their capacity to maintain their e ectivity as the scale of the data grows, and how it can be improved. We focus on problems for which the existing solutions struggle when the scale grows. Therefore, we skip classi cation and regression problems and focus on feature selection, anomaly detection, graph construction and explainable machine learning. We analyze four di erent strategies to obtain scalable algorithms. First, we explore distributed computation, which is used in all of the presented algorithms. Besides this technique, we also examine the use of approximate models to speed up computations, the design of new models that take advantage of a characteristic of the input data to simplify training and the enhancement of simple models to enable them to manage large-scale learning. We have implemented four new algorithms and six versions of existing ones that tackle the mentioned problems and for each one we report experimental results that show both their validity in comparison with competing methods and their capacity to scale to large datasets. All the presented algorithms have been made available for download and are being published in journals to enable practitioners and researchers to use them.[Resumen] El reciente aumento de la cantidad de datos disponibles ha dado lugar a una nueva y prometedora era del aprendizaje máquina. Los éxitos en este campo se están sucediendo a un ritmo cada vez mayor gracias a la capacidad de algunos algoritmos de aprovechar inmensas cantidades de datos para producir predicciones difíciles y muy certeras. Sin embargo, muchos de los algoritmos hasta ahora disponibles para los científicos de datos han perdido su efectividad en este nuevo escenario debido a las complicaciones asociadas al aprendizaje a gran escala. Trabajar con grandes conjuntos de datos conlleva problemas logísticos, limita la complejidad computacional y espacial de los algoritmos utilizados, favorece los métodos con pocos o ningún hiperparámetro a configurar y muestra complicaciones específicas que dificultan el aprendizaje. Esta tesis se centra en la escalabilidad de los algoritmos de aprendizaje máquina, es decir, en su capacidad de mantener su efectividad a medida que la escala del conjunto de datos aumenta. Ponemos el foco en problemas cuyas soluciones actuales tienen problemas al aumentar la escala. Por tanto, obviando la clasificación y la regresión, nos centramos en la selección de características, detección de anomalías, construcción de grafos y en el aprendizaje máquina explicable. Analizamos cuatro estrategias diferentes para obtener algoritmos escalables. En primer lugar, exploramos la computación distribuida, que es utilizada en todos los algoritmos presentados. Además de esta técnica, también examinamos el uso de modelos aproximados para acelerar los cálculos, el dise~no de modelos que aprovechan una particularidad de los datos de entrada para simplificar el entrenamiento y la potenciación de modelos simples para adecuarlos al aprendizaje a gran escala. Hemos implementado cuatro nuevos algoritmos y seis versiones de algoritmos existentes que tratan los problemas mencionados y para cada uno de ellos detallamos resultados experimentales que muestran tanto su validez en comparación con los métodos previamente disponibles como su capacidad para escalar a grandes conjuntos de datos. Todos los algoritmos presentados han sido puestos a disposición del lector para su descarga y se han difundido mediante publicaciones en revistas científicas para facilitar que tanto investigadores como científicos de datos puedan conocerlos y utilizarlos.[Resumo] O recente aumento na cantidade de datos dispo~nibles deu lugar a unha nova e prometedora era no aprendizaxe máquina. Os éxitos neste eido estanse a suceder a un ritmo cada vez maior gracias a capacidade dalgúns algoritmos de aproveitar inmensas cantidades de datos para producir prediccións difíciles e moi acertadas. Non obstante, moitos dos algoritmos ata agora dispo~nibles para os científicos de datos perderon a súa efectividade neste novo escenario por mor das complicacións asociadas ao aprendizaxe a grande escala. Traballar con grandes conxuntos de datos leva consigo problemas loxísticos, limita a complexidade computacional e espacial dos algoritmos empregados, favorece os métodos con poucos ou ningún hiperparámetro a configurar e ten complicacións específicas que dificultan o aprendizaxe. Esta tese céntrase na escalabilidade dos algoritmos de aprendizaxe máquina, é dicir, na súa capacidade de manter a súa efectividade a medida que a escala do conxunto de datos aumenta. Tratamos problemas para os que as solucións dispoñibles teñen problemas cando crece a escala. Polo tanto, deixando no canto a clasificación e a regresión, centrámonos na selección de características, detección de anomalías, construcción de grafos e no aprendizaxe máquina explicable. Analizamos catro estratexias diferentes para obter algoritmos escalables. En primeiro lugar, exploramos a computación distribuída, que empregamos en tódolos algoritmos presentados. Ademáis desta técnica, tamén examinamos o uso de modelos aproximados para acelerar os cálculos, o deseño de modelos que aproveitan unha particularidade dos datos de entrada para simplificar o adestramento e a potenciación de modelos sinxelos para axeitalos ao aprendizaxe a gran escala. Implementamos catro novos algoritmos e seis versións de algoritmos existentes que tratan os problemas mencionados e para cada un deles expoñemos resultados experimentais que mostran tanto a súa validez en comparación cos métodos previamente dispoñibles como a súa capacidade para escalar a grandes conxuntos de datos. Tódolos algoritmos presentados foron postos a disposición do lector para a súa descarga e difundíronse mediante publicacións en revistas científicas para facilitar que tanto investigadores como científicos de datos poidan coñecelos e empregalos

    Scalable Feature Selection Using ReliefF Aided by Locality-Sensitive Hashing

    Get PDF
    Financiado para publicación en acceso aberto: Universidade da Coruña/CISUG[Abstract] Feature selection algorithms, such as ReliefF, are very important for processing high-dimensionality data sets. However, widespread use of popular and effective such algorithms is limited by their computational cost. We describe an adaptation of the ReliefF algorithm that simplifies the costliest of its step by approximating the nearest neighbor graph using locality-sensitive hashing (LSH). The resulting ReliefF-LSH algorithm can process data sets that are too large for the original ReliefF, a capability further enhanced by distributed implementation in Apache Spark. Furthermore, ReliefF-LSH obtains better results and is more generally applicable than currently available alternatives to the original ReliefF, as it can handle regression and multiclass data sets. The fact that it does not require any additional hyperparameters with respect to ReliefF also avoids costly tuning. A set of experiments demonstrates the validity of this new approach and confirms its good scalability.This study has been supported in part by the Spanish Ministerio de Economía y Competitividad (projects PID2019-109238GB-C2 and TIN 2015-65069-C2-1-R and 2-R), partially funded by FEDER funds of the EU and by the Xunta de Galicia (projects ED431C 2018/34 and Centro Singular de Investigación de Galicia, accreditation 2016-2019). The authors wish to thank the Fundación Pública Galega Centro Tecnolóxico de Supercomputación de Galicia (CESGA) for the use of their computing resources. Funding for open access charge: Universidade da Coruña/CISUGXunta de Galicia; ED431C 2018/3

    Regression Tree Based Explanation for Anomaly Detection Algorithm

    Get PDF
    [Abstract] This work presents EADMNC (Explainable Anomaly Detection on Mixed Numerical and Categorical spaces), a novel approach to address explanation using an anomaly detection algorithm, ADMNC, which provides accurate detections on mixed numerical and categorical input spaces. Our improved algorithm leverages the formulation of the ADMNC model to offer pre-hoc explainability based on CART (Classification and Regression Trees). The explanation is presented as a segmentation of the input data into homogeneous groups that can be described with a few variables, offering supervisors novel information for justifications. To prove scalability and interpretability, we list experimental results on real-world large datasets focusing on network intrusion detection domain.This research was partially funded by European Union ERDF funds, Ministerio de Ciencia e Innovación grant number PID2019-109238GB-C22, Xunta de Galicia through the accreditation of Centro Singular de Investigación 2016-2020, Ref. ED431G/01 and Grupos de Referencia Competitiva, Ref. GRC2014/035Xunta de Galicia; ED431G/01Xunta de Galicia; GRC2014/03

    Sustainable personalisation and explainability in Dyadic Data Systems

    Get PDF
    [Abstract]: Systems that rely on dyadic data, which relate entities of two types together, have become ubiquitously used in fields such as media services, tourism business, e-commerce, and others. However, these systems have had a tendency to be black-box systems, despite their objective of influencing people's decisions. There is a lack of research on providing personalised explanations to the outputs of systems that make use of such data, that is, integrating the idea of Explainable Artificial Intelligence into the field of dyadic data. Moreover, the existing approaches rely heavily on Deep Learning models for their training, reducing their overall sustainability. In this work, we propose a computationally efficient model which provides personalisation by generating explanations based on user-created images. In the context of a particular dyadic data system, the restaurant review platform TripAdvisor, we predict, for any (user, restaurant) pair, the review of the restaurant that is most adequate to present it to the user, based on their personal preferences. This model exploits the usage of efficient Matrix Factorisation techniques combined with feature-rich embeddings of the pre-trained Image Classification models, developing a method capable of providing transparency to dyadic data systems while reducing as much as 80% the carbon emissions of training compared to alternative approaches

    Fast anomaly detection with locality-sensitive hashing and hyperparameter autotuning

    Get PDF
    This paper presents LSHAD, an anomaly detection (AD) method based on Locality Sensitive Hashing (LSH), capable of dealing with large-scale datasets. The resulting algorithm is highly parallelizable and its implementation in Apache Spark further increases its ability to handle very large datasets. Moreover, the algorithm incorporates an automatic hyperparameter tuning mechanism so that users do not have to implement costly manual tuning. Our LSHAD method is novel as both hyperparameter automation and distributed properties are not usual in AD techniques. Our results for experiments with LSHAD across a variety of datasets point to state-of-the-art AD performance while handling much larger datasets than state-of-the-art alternatives. In addition, evaluation results for the tradeoff between AD performance and scalability show that our method offers significant advantages over competing methods.This research has been financially supported in part by the Spanish Ministerio de Economía y Competitividad (project PID-2019-109238GB-C22) and by the Xunta de Galicia (grants ED431C 2018/34 and ED431G 2019/01) through European Union ERDF funds. CITIC, as a research center accredited by the Galician University System, is funded by the Consellería de Cultura, Educación e Universidades of the Xunta de Galicia, supported 80% through ERDF Funds (ERDF Operational Programme Galicia 2014–2020) and 20% by the Secretaría Xeral de Universidades (Grant ED431G 2019/01).This work was also supported by National Funds through the Portuguese FCT - Fundação para a Ciência e a Tecnologia (projects UIDB/00760/2020 and UIDP/00760/2020).info:eu-repo/semantics/publishedVersio

    Sustainable Transparency in Recommender Systems: Bayesian Ranking of Images for Explainability

    Full text link
    Recommender Systems have become crucial in the modern world, commonly guiding users towards relevant content or products, and having a large influence over the decisions of users and citizens. However, ensuring transparency and user trust in these systems remains a challenge; personalized explanations have emerged as a solution, offering justifications for recommendations. Among the existing approaches for generating personalized explanations, using visual content created by the users is one particularly promising option, showing a potential to maximize transparency and user trust. Existing models for explaining recommendations in this context face limitations: sustainability has been a critical concern, as they often require substantial computational resources, leading to significant carbon emissions comparable to the Recommender Systems where they would be integrated. Moreover, most models employ surrogate learning goals that do not align with the objective of ranking the most effective personalized explanations for a given recommendation, leading to a suboptimal learning process and larger model sizes. To address these limitations, we present BRIE, a novel model designed to tackle the existing challenges by adopting a more adequate learning goal based on Bayesian Pairwise Ranking, enabling it to achieve consistently superior performance than state-of-the-art models in six real-world datasets, while exhibiting remarkable efficiency, emitting up to 75% less CO2{_2} during training and inference with a model up to 64 times smaller than previous approaches

    Los docentes que no han dejado de ser alumnos. Retos y experiencias en dos medios diferentes: online vs presencial

    Get PDF
    En este trabajo presentamos cómo ha sido nuestra primera experiencia docente en dos marcos distintos: por un lado en una asignatura presencial del Grado de Informática de la Universidade da Coruña y por el otro en una asignatura online en el Máster Universitario en Investigación en Inteligencia Artificial de la Universidad Internacional Menéndez Pelayo. La experiencia de impartir simultáneamente ambas asignaturas nos ha permitido conocer las diferencias entre estos dos tipos de enseñanza. Nuestra intención es poner de manifiesto cómo hemos solventado los retos que nos plantearon las dos asignaturas, a fin de que el lector pueda servirse de nuestras breves pero intensas peripecias docentes.In this work, we describe our first teaching experience in two different areas: a face-to-face subject in the Computer Science Degree of the University of A Coruña and an online subject in the Research Master’s Degree in Artificial Intelligence of the Menéndez Pelayo International University. The experience of teaching both subjects simultaneously has allowed us to know the differences between both areas. We want to show how we solved the challenges posed by these two subjects with the aim that the reader can use our brief but intense teaching adventures

    Multithreaded and Spark parallelization of feature selection filters

    Get PDF
    ©2016 Elsevier B.V. All rights reserved. This manuscript version is made available under the CC-BY-NC-ND 4.0 license https://creativecommons.org/licenses/bync-nd/4.0/. This version of the article has been accepted for publication in Journal of Computational Science. The Version of Record is available online at https://doi.org/10.1016/j.jocs.2016.07.002Versión final aceptada de: C. Eiras-Franco, V. Bolón-Canedo, S. Ramos, J. González-Domínguez, A. Alonso-Betanzos, and J. Touriño, "Multithreaded and Spark parallelization of feature selection filters", Journal of Computational Science, Vol. 17, Part 3, Nov. 2016, Pp. 609-619[Abstract]: Vast amounts of data are generated every day, constituting a volume that is challenging to analyze. Techniques such as feature selection are advisable when tackling large datasets. Among the tools that provide this functionality, Weka is one of the most popular ones, although the implementations it provides struggle when processing large datasets, requiring excessive times to be practical. Parallel processing can help alleviate this problem, effectively allowing users to work with Big Data. The computational power of multicore machines can be harnessed by using multithreading and distributed programming, effectively helping to tackle larger problems. Both these techniques can dramatically speed up the feature selection process allowing users to work with larger datasets. The reimplementation of four popular feature selection algorithms included in Weka is the focus of this work. Multithreaded implementations previously not included in Weka as well as parallel Spark implementations were developed for each algorithm. Experimental results obtained from tests on real-world datasets show that the new versions offer significant reductions in processing times.This work has been financed in part by Xunta de Galicia under Research Network R2014/041 and project GRC2014/035, and by Spanish Ministerio de Economía y Competitividad under projects TIN2012-37954 and TIN-2015-65069-C2-1-R, partially funded by FEDER funds of the European Union. V. Bolón-Canedo acknowledges support of the Xunta de Galicia under postdoctoral Grant code ED481B 2014/164-0. Additionally, the collaboration of Jorge Veiga on setting up and using the MREv tool for Spark execution was essential for this work.Xunta de Galicia; R2014/041Xunta de Galicia; GRC2014/035Xunta de Galicia; ED481B 2014/164-

    Proceedings of the 8th International Conference on Data Science, Technology and Applications (DATA 2019)

    Get PDF
    [Abstract] The aim of this work is to propose different statistical and machine learning methodologies for identifying anomalies and control the quality of energy efficiency and hygrothermal comfort in buildings. Companies focused on energy sector for buildings are interested on statistical and machine learning tools to automate the control of energy consumption and ensure quality of Heat Ventilation and Air Conditioning (HVAC) installations. Consequently, a methodology based on the application of the Local Correlation Integral (LOCI) anomaly detection technique has been proposed. In addition, the most critical variables for anomaly detection are identified by using ReliefF method. Once vectors of critical variables are obtained, multivariate and univariate control charts can be applied to control the quality of HVAC installations (consumption, thermal comfort). In order to test the proposed methodology, the companies involved in this project have provided the case study of a store of a clothing brand located in a shopping center in Panama. It is important to note that this is a controlled case study for which all the anomalies have been previously identified by maintenance personnel. Moreover, as an alternatively solution, in addition to machine learning and multivariate techniques, new nonparametric control charts for functional data based on data depth have been proposed and applied to curves of daily energy consumption in HVAC.Ministerio de Asuntos Económicos y Transformación Digital; MTM2014-52876-RMinisterio de Asuntos Económicos y Transformación Digital; MTM2017-82724-RXunta de Galicia; ED431C-2016-015Centro Singular de Investigación de Galicia; ED431G/01 2016-19Centro de Investigación en Tecnoloxías da Información e as Comunicacións da Universidade da Coruña; PC18/03Escuela Politécnica Nacional of Ecuador; PII-DM-002-201

    Análisis fisiológico de las tareas de entrenamiento en fútbol sala

    Get PDF
    It is important to be able to accurately monitor training load during futsal drills intended for physical development to allow the optimization of training parameters. The aim of this study was to analyze the conditional profile of futsal drills. Eight professional futsal players were assessed for heart rate, blood lactate, duration, and intervention time responses to 8 commonly used futsal training drills. Statistical analysis was realised with SPSS 20.0, and comprises general descriptive statistics and two ANOVA with Bonferroni correction. The results showed that real game exercises not reached the physiological load of matches. Furthermore, speed endurance drills reached bigger lactate concentration than the other futsal training activities. Finally, transition, mobility, full field, 4x4 and fly-goalkeeper drills had similar conditional characteristics, near to mixed endurance and anaerobic threshold. In conclusion, analyzed drills are adequate for the development of the metabolic pathways essential in futsal.El objetivo del estudio fue obtener un perfil condicional de las tareas en fútbol sala, analizándolas en función de 5 variables (tiempo de intervención, duración, FCMáx, FCMedia y concentración de lactato). Participaron 8 jugadores profesionales, con una muestra total de 70 tareas agrupadas en 8 subcategorías. El análisis estadístico fue realizado con el SPSS 20.0, y consta de análisis descriptivos generales y dos pruebas ANOVA de un factor con corrección de Bonferroni. Los resultados muestran que las tareas de juego real no alcanzan la carga fisiológica de la competición. Además, las tareas de resistencia a la velocidad alcanzan una lactacidemia superior al resto. Finalmente, las tareas de transición, movilidad, campo completo, 4x4 y portero-jugador tienen características condicionales similares, adecuadas para el desarrollo de la resistencia mixta y umbral anaeróbico. Se concluye que las tareas analizadas sirven para el desarrollo de las diferentes vías metabólicas características del fútbol sala
    corecore